ieee cvf international conference
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models Zhimin Chen
Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding. However, their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap. In this work, we propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our method employs semantic masks from foundation models to guide the masking and reconstruction process for the masked autoen-coder, enabling more focused attention on foreground representations.
- Asia > Middle East > Israel (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.70)
- Information Technology > Sensing and Signal Processing > Image Processing (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Appendix 1 Back imagination and Back speech
Figure 1: The illustrative examples for two proposed techniques: Back-imagination and Back-speech. Tiny ImageNet [Le and Y ang, 2015] serves as a compact version of the comprehensive ImageNet dataset. The Stanford Sentiment Treebank-2 (SST -2) [Socher et al., 2013] is a sentiment classification dataset Given the scarcity of datasets for understanding natural language in visual scenes, we introduce a novel textual entailment dataset, named Textual Natural Contextual Classification (TNCC). This dataset is formulated on the foundation of Crisscrossed Captions [Parekh et al., 2020], an image In this work, we employ a uniform experimental configuration for both textual entailment and sentiment classification tasks. For the image classification task, we employ the ResNet18 [He et al., 2015] model, which is considered more suitable for small datasets.
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.57)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.55)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.55)
- Information Technology > Artificial Intelligence > Vision > Image Understanding (0.35)
DynPoint: Dynamic Neural Point For View Synthesis
These estimates are subsequently utilized to aggregate information from reference frames into the target frame. Subsequently, hierarchical neural point clouds are constructed based on the aggregated information. This hierarchical point cloud set is then employed to synthesize views of the target frame.
- Asia > Middle East > Israel (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.94)
- Asia > Singapore (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Jiangsu Province > Yancheng (0.04)
- (5 more...)
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering Ziyi Bai
Video Question Answering (VideoQA) has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Despite the recent success of large vision language models in many multi-modal tasks, complex situation reasoning over videos involving multiple human-object interaction events still remains challenging.
- North America > United States > Pennsylvania (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)